{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "cKEoXRSzUUCv"
   },
   "source": [
    "# Collaborative filtering: refactoring the code\n",
    "-----\n",
    "\n",
    "In this practical, you will need to refactor the code seen during the lesson in order to deal with the [Movielens 1M Dataset](https://grouplens.org/datasets/movielens/1m/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "OeHGO4qbUUC4"
   },
   "source": [
    "## 1. Preparations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "frj5rX9wUUC_"
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import os.path as op\n",
    "import imp\n",
    "import numpy as np\n",
    "\n",
    "from zipfile import ZipFile\n",
    "try:\n",
    "    from urllib.request import urlretrieve\n",
    "except ImportError:  # Python 2 compat\n",
    "    from urllib import urlretrieve\n",
    "\n",
    "# this line need to be changed:\n",
    "data_folder = '/content/'\n",
    "\n",
    "\n",
    "ML_1M_URL = \"http://files.grouplens.org/datasets/movielens/ml-1m.zip\"\n",
    "ML_1M_FILENAME = op.join(data_folder,ML_1M_URL.rsplit('/', 1)[1])\n",
    "ML_1M_FOLDER = op.join(data_folder,'ml-1m')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "utzHaMDsV1Bq"
   },
   "outputs": [],
   "source": [
    "if not op.exists(ML_1M_FILENAME):\n",
    "    print('Downloading %s to %s...' % (ML_1M_URL, ML_1M_FILENAME))\n",
    "    urlretrieve(ML_1M_URL, ML_1M_FILENAME)\n",
    "\n",
    "if not op.exists(ML_1M_FOLDER):\n",
    "    print('Extracting %s to %s...' % (ML_1M_FILENAME, ML_1M_FOLDER))\n",
    "    ZipFile(ML_1M_FILENAME).extractall(data_folder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hHgLg4jBUUDp"
   },
   "source": [
    "## 2. Data analysis and formating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZlGyPSBpUUDt"
   },
   "source": [
    "As in the lesson, we start by loading the data with [Python Data Analysis Library](http://pandas.pydata.org/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Qk6Vp83IV1Bv"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "all_ratings = pd.read_csv(op.join(ML_1M_FOLDER, 'ratings.dat'), sep='::',\n",
    "                          names=[\"user_id\", \"item_id\", \"ratings\", \"timestamp\"],engine='python')\n",
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "UDPeRguMV1Bz",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "list_movies_names = []\n",
    "list_item_ids = []\n",
    "with open(op.join(ML_1M_FOLDER, 'movies.dat'), encoding = \"ISO-8859-1\") as fp:\n",
    "    for line in fp:\n",
    "        list_item_ids.append(line.split('::')[0])\n",
    "        list_movies_names.append(line.split('::')[1])\n",
    "        \n",
    "movies_names = pd.DataFrame(list(zip(list_item_ids, list_movies_names)), \n",
    "               columns =['item_id', 'item_name']) \n",
    "movies_names.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6nVdjC0IV1B3"
   },
   "source": [
    "Here we add the title of the movies to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "21gIXEkXV1B4"
   },
   "outputs": [],
   "source": [
    "movies_names['item_id']=movies_names['item_id'].astype(int)\n",
    "all_ratings['item_id']=all_ratings['item_id'].astype(int)\n",
    "all_ratings = all_ratings.merge(movies_names,on='item_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_XJlXE1mV1B8"
   },
   "outputs": [],
   "source": [
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "enyJuYhUUUEL"
   },
   "source": [
    "The dataframe `all_ratings` contains all the raw data for our problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "3M7-jinQUUEO"
   },
   "outputs": [],
   "source": [
    "#number of entries\n",
    "len(all_ratings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ofVcE781UUEa"
   },
   "outputs": [],
   "source": [
    "all_ratings['ratings'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tCdV2qzMUUEk"
   },
   "outputs": [],
   "source": [
    "all_ratings['ratings'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zU9vARwJUUEs"
   },
   "outputs": [],
   "source": [
    "all_ratings['user_id'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "P1QoS71CUUE7"
   },
   "outputs": [],
   "source": [
    "# number of unique users\n",
    "total_user_id = len(all_ratings['user_id'].unique())\n",
    "print(total_user_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-Az4KDFgV1CQ"
   },
   "source": [
    "We see that as in the lesson, the users seem to be indexed from 1 to 6040. Let's check that below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Pj8ZAzmIV1CQ"
   },
   "outputs": [],
   "source": [
    "list_user_id = list(all_ratings['user_id'].unique())\n",
    "list_user_id.sort()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "l9omI-cwV1CT"
   },
   "outputs": [],
   "source": [
    "for i,j in enumerate(list_user_id):\n",
    "    if j != i+1:\n",
    "        print(i,j) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "h-LytszyV1CV"
   },
   "source": [
    "We create a new column `user_num` to get an index from 0 to 6039 for users:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "jE6WVgT-V1CV"
   },
   "outputs": [],
   "source": [
    "all_ratings['user_num'] = all_ratings['user_id'].apply(lambda x :x-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "6HldBddZV1CY"
   },
   "outputs": [],
   "source": [
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "W0OThRYhV1Cb"
   },
   "source": [
    "We now look at movies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "l3gfokbNUUFH"
   },
   "outputs": [],
   "source": [
    "all_ratings['item_id'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "8ncInsk_V1Cg"
   },
   "outputs": [],
   "source": [
    "# number of unique rated items\n",
    "total_item_id = len(all_ratings['item_id'].unique())\n",
    "print(total_item_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3gGu2LjiV1Ci"
   },
   "source": [
    "Here there is a clear problem: there are 3706 different movies but the range of `item_id` starts at 1 and ends at 3952. So there are gaps, so the first thing you will need to do is to create a new column `item_num` so that all movies are indexed from 0 to 3705."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "9Zs7e-pyV1Cj"
   },
   "outputs": [],
   "source": [
    "#\n",
    "# your code here\n",
    "#"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "weNJOlzcV1Cl"
   },
   "source": [
    "This function will verify that your result is correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ktctMT12V1Cl"
   },
   "outputs": [],
   "source": [
    "def check_ratings_num(df):\n",
    "    item_num = set(df['item_num'])\n",
    "    if item_num == set(range(len(item_num))):\n",
    "        return True\n",
    "    else:\n",
    "        return False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "w4w1XQnfV1Cp"
   },
   "outputs": [],
   "source": [
    "check_ratings_num(all_ratings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HYa43L6EV1Cr"
   },
   "outputs": [],
   "source": [
    "all_ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Ur4bjuniUUFj"
   },
   "source": [
    "Now we will split the data in _train_, _val_ and _test_ be using a pre-defined function from [scikit-learn](http://scikit-learn.org/stable/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZT5oxhoGUUFm"
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "ratings_trainval, ratings_test = train_test_split(all_ratings, test_size=0.1, random_state=42)\n",
    "\n",
    "ratings_train, ratings_val = train_test_split(ratings_trainval, test_size=0.1, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Q7JF-d05UUGc"
   },
   "source": [
    "## 3. The model\n",
    "\n",
    "We will now modify a bit the `FactorizationModel` class seen during the lesson. Internally, we will still use the `Model_dot` but now we use the PyTorch dataloader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tsrfFi1QUUGd"
   },
   "outputs": [],
   "source": [
    "import torch.nn as nn\n",
    "import torch\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4dYlOkCzV1C0"
   },
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "P3OnBFlOV1C2"
   },
   "outputs": [],
   "source": [
    "def df_2_tensor(df, device):\n",
    "    # return a triplet user_num, item_num, rating from the dataframe\n",
    "    user_num = np.asarray(df['user_num'])\n",
    "    item_num = np.asarray(df['item_num'])\n",
    "    rating = np.asarray(df['ratings'])\n",
    "    return torch.from_numpy(user_num).to(device), torch.from_numpy(item_num).to(device), torch.from_numpy(rating).to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "W2UUX0GDV1C4"
   },
   "source": [
    "Below, we construct 3 tensors containing the `user_num`, `item_num` and `rating` for the training set. All tensors have the same shape so that `train_user_num[i]` watched `train_item_num[i]` and gave a rating of `train_rating[i]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "LY7lG3M_V1C5"
   },
   "outputs": [],
   "source": [
    "train_user_num, train_item_num, train_rating = df_2_tensor(ratings_train,device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "SY39426KV1C7"
   },
   "source": [
    "We now do the same thing for the validation and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "o1bjHQExV1C8"
   },
   "outputs": [],
   "source": [
    "val_user_num, val_item_num, val_rating = df_2_tensor(ratings_val,device)\n",
    "test_user_num, test_item_num, test_rating = df_2_tensor(ratings_test,device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HpkKu7PVUUGp"
   },
   "source": [
    "The code below is taken from the lesson"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eB6_y1nMUUGq"
   },
   "outputs": [],
   "source": [
    "class ScaledEmbedding(nn.Embedding):\n",
    "    \"\"\"\n",
    "    Embedding layer that initialises its values\n",
    "    to using a normal variable scaled by the inverse\n",
    "    of the embedding dimension.\n",
    "    \"\"\"\n",
    "    def reset_parameters(self):\n",
    "        \"\"\"\n",
    "        Initialize parameters.\n",
    "        \"\"\"\n",
    "\n",
    "        self.weight.data.normal_(0, 1.0 / self.embedding_dim)\n",
    "        if self.padding_idx is not None:\n",
    "            self.weight.data[self.padding_idx].fill_(0)\n",
    "\n",
    "\n",
    "class ZeroEmbedding(nn.Embedding):\n",
    "    \"\"\"\n",
    "    Used for biases.\n",
    "    \"\"\"\n",
    "\n",
    "    def reset_parameters(self):\n",
    "        \"\"\"\n",
    "        Initialize parameters.\n",
    "        \"\"\"\n",
    "\n",
    "        self.weight.data.zero_()\n",
    "        if self.padding_idx is not None:\n",
    "            self.weight.data[self.padding_idx].fill_(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ktXRW3-4UUGt"
   },
   "outputs": [],
   "source": [
    "class DotModel(nn.Module):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 num_users,\n",
    "                 num_items,\n",
    "                 embedding_dim=32):\n",
    "        \n",
    "        super(DotModel, self).__init__()\n",
    "        \n",
    "        self.embedding_dim = embedding_dim\n",
    "        \n",
    "        self.user_embeddings = ScaledEmbedding(num_users, embedding_dim)\n",
    "        self.item_embeddings = ScaledEmbedding(num_items, embedding_dim)\n",
    "        self.user_biases = ZeroEmbedding(num_users, 1)\n",
    "        self.item_biases = ZeroEmbedding(num_items, 1)\n",
    "                \n",
    "        \n",
    "    def forward(self, user_ids, item_ids):\n",
    "        #\n",
    "        # your code\n",
    "        #\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "PlbbwU1ze5d7"
   },
   "outputs": [],
   "source": [
    "net = DotModel(total_user_id,total_item_id).to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ysX9Q9pxiMG4"
   },
   "source": [
    "Now test your network on a small batch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "I20UhBLZjBEs"
   },
   "outputs": [],
   "source": [
    "predictions = net(train_user_num[:5], train_item_num[:5])\n",
    "predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7sqlTLCyiwzl"
   },
   "outputs": [],
   "source": [
    "def regression_loss(predicted_ratings, observed_ratings):\n",
    "    return ((observed_ratings - predicted_ratings) ** 2).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ins38kThjNl0"
   },
   "outputs": [],
   "source": [
    "loss_fn = regression_loss\n",
    "loss = loss_fn(predictions, train_rating[:5])\n",
    "loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mYpYDeK0V1DN"
   },
   "source": [
    "Now you need to construct a dataset and a dataloader. For this, you can define a first function taking as arguments the tensors defined above and returning a list; then a second function taking as argument a dataset, the batchsize and a boolean for the shuffling. We will not use anymore the functions `shuffle` and `minibatch` used in the lesson."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "c1VFUdHtV1DO"
   },
   "outputs": [],
   "source": [
    "def tensor_2_dataset(user,item,rating):\n",
    "    # your code here\n",
    "    \n",
    "def make_dataloader(dataset,bs,shuffle):\n",
    "    # your code here\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "h-Y4ruIPV1DS"
   },
   "outputs": [],
   "source": [
    "train_dataset = tensor_2_dataset(train_user_num,train_item_num, train_rating)\n",
    "val_dataset = tensor_2_dataset(val_user_num,val_item_num,val_rating)\n",
    "test_dataset = tensor_2_dataset(test_user_num, test_item_num, test_rating)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "rXvR2apHV1DU"
   },
   "outputs": [],
   "source": [
    "train_dataloader = make_dataloader(train_dataset,1024,True)\n",
    "val_dataloader = make_dataloader(val_dataset,1024, False)\n",
    "test_dataloader = make_dataloader(test_dataset,1024,False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "wz1h-gG-V1DX"
   },
   "source": [
    "Here you need to modify the code seen during the lesson:\n",
    " - remove the batch_size in the init\n",
    " - the fit function should now take as argument a dataloader for the training and a dataloader for the validation. AT the end of each epoch, you run the test method on the validation set. Then you print both the loss on the training set and on the validation set to see if you are overfitting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2ekPPr7SUUG7"
   },
   "outputs": [],
   "source": [
    "class FactorizationModel(object):\n",
    "    \n",
    "    def __init__(self, embedding_dim=32, n_iter=10, l2=0.0,\n",
    "                 learning_rate=1e-2, device=device, net=None, num_users=None,\n",
    "                 num_items=None,random_state=None):\n",
    "        \n",
    "        self._embedding_dim = embedding_dim\n",
    "        self._n_iter = n_iter\n",
    "        self._learning_rate = learning_rate\n",
    "        self._l2 = l2\n",
    "        self._device = device\n",
    "        self._num_users = num_users\n",
    "        self._num_items = num_items\n",
    "        self._net = net\n",
    "        self._optimizer = None\n",
    "        self._loss_func = None\n",
    "        self._random_state = random_state or np.random.RandomState()\n",
    "             \n",
    "        \n",
    "    def _initialize(self):\n",
    "        if self._net is None:\n",
    "            self._net = DotModel(self._num_users, self._num_items, self._embedding_dim).to(self._device)\n",
    "        \n",
    "        self._optimizer = optim.Adam(\n",
    "                self._net.parameters(),\n",
    "                lr=self._learning_rate,\n",
    "                weight_decay=self._l2\n",
    "            )\n",
    "        \n",
    "        self._loss_func = regression_loss\n",
    "        \n",
    "    \n",
    "    @property\n",
    "    def _initialized(self):\n",
    "        return self._optimizer is not None\n",
    "    \n",
    "    def __repr__(self):\n",
    "        return _repr_model(self)\n",
    "    \n",
    "    def fit(self, dataloader, val_dataloader, verbose=True):       \n",
    "        if not self._initialized:\n",
    "            self._initialize()\n",
    "            \n",
    "        for epoch_num in range(self._n_iter):\n",
    "            epoch_loss = 0.0\n",
    "            self._net.train(True)\n",
    "\n",
    "            #\n",
    "            # your code\n",
    "            #\n",
    "                \n",
    "            \n",
    "            epoch_loss = epoch_loss / (minibatch_num + 1)\n",
    "            loss_test = self.test(val_dataloader)\n",
    "\n",
    "            if verbose:\n",
    "                print('Epoch {}: loss_train {}, loss_val {}'.format(epoch_num, epoch_loss,loss_test))\n",
    "        \n",
    "            if np.isnan(epoch_loss) or epoch_loss == 0.0:\n",
    "                raise ValueError('Degenerate epoch loss: {}'\n",
    "                                 .format(epoch_loss))\n",
    "    \n",
    "    \n",
    "    def test(self,dataloader, verbose = False):\n",
    "        self._net.train(False)\n",
    "        L1loss = torch.nn.L1Loss()\n",
    "        test_loss = 0.0\n",
    "        test_mae = 0.0\n",
    "        #\n",
    "        # your code here (mae = mean absolute error)\n",
    "        #\n",
    "                \n",
    "        test_loss = test_loss / (minibatch_num + 1)\n",
    "        test_mae = test_mae / (minibatch_num+1)\n",
    "        if verbose:\n",
    "            print(f\"RMSE: {np.sqrt(test_loss)}, MAE: {test_mae}\")\n",
    "        return loss.item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qGuzPUNrUUG_"
   },
   "outputs": [],
   "source": [
    "model = FactorizationModel(embedding_dim=50,  # latent dimensionality\n",
    "                                   n_iter=5,  # number of epochs of training\n",
    "                                   learning_rate=5e-4,\n",
    "                                   l2=1e-8,  # strength of L2 regularization\n",
    "                                   num_users=total_user_id,\n",
    "                                   num_items=total_item_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2Bd2CIosUUHJ"
   },
   "outputs": [],
   "source": [
    "model.fit(train_dataloader,val_dataloader)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HKyc7Bg5UUHR"
   },
   "outputs": [],
   "source": [
    "_= model.test(test_dataloader,True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yXVIF75IUUHW"
   },
   "source": [
    "Play with the parameter to beat the benchmarks presented here: [Surprise](https://github.com/NicolasHug/Surprise)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VK3KgG47V1Di"
   },
   "source": [
    "## 4. Best and worst movies\n",
    "\n",
    "Now you need to rank the movies according to their bias. For this, you need to recover the biases of the movies, make a list of the pairs `[name of the movie, its bias]` and then sort this list according to the biases. You can use the method sort of a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7KDDTQfxV1Di"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Dc1ZKW9XV1Dl"
   },
   "source": [
    "## 5. PCA of movies' embeddings\n",
    "\n",
    "Now you can also plpay with the embeddings learned by your algorithm for the movies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HX37hFDsV1Dl"
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from operator import itemgetter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oWomt-l-V1Dp"
   },
   "source": [
    "[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "include_colab_link": false,
   "name": "08_collaborative_filtering_1M.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
